Skip to content

Fix race condition: subscriber misses publisher-connected notification (No such feed)#1254

Open
vladopol wants to merge 1 commit into
strukturag:masterfrom
vladopol:fix/no-such-feed-race-condition
Open

Fix race condition: subscriber misses publisher-connected notification (No such feed)#1254
vladopol wants to merge 1 commit into
strukturag:masterfrom
vladopol:fix/no-such-feed-race-condition

Conversation

@vladopol

Copy link
Copy Markdown

PR: Fix "No such feed" race condition on subscriber reconnect

Problem

When multiple clients reconnect simultaneously (e.g. after a proxy timeout or network
interruption), subscribers frequently fail to receive audio/video from publishers with
No such feed (N) error from Janus VideoRoom, leaving them unable to hear other
participants despite the call appearing active.

Root cause: lost wakeup in async.Notifier

The retry logic in sfu/janus/subscriber.go relies on newPublisherConnectedWaiter
which uses async.Notifier to wait until the publisher signals it is connected (via
webrtcup event → notifyPublisherConnected).

The race condition:

Time →
  Publisher: webrtcup fires → notifyPublisherConnected() → Notify("pub_X")
             [no waiters exist yet — notification is LOST]
  Subscriber: tries joinRoom → "No such feed"
              → creates waiter → waiter.Wait(ctx) blocks forever
              → context expires → subscriber gives up

async.Notifier.Notify() only wakes existing waiters; if none exist at the time of
notification, the event is permanently lost. Subscribers created after the publisher
connected never receive the signal and wait until context timeout.

Secondary issue: busy-loop after sticky fix

The original code creates the waiter once before the retry loop:

waiter, stop := p.mcu.newPublisherConnectedWaiter(...)
defer stop()

retry:
    // join attempt
    waiter.Wait(ctx)   // same object reused
    goto retry

If NewWaiter returns an already-closed channel (sticky behavior), subsequent
Wait() calls on the same waiter return immediately → tight busy-loop spamming
Janus with join requests until context expiry.

Fix

1. async/notifier.go — sticky notifications (30-second TTL)

Notifier now remembers the timestamp of the last Notify() call per key.
NewWaiter() checks: if the key was notified within the last 30 seconds, it returns
an already-closed channel so the caller retries immediately instead of waiting
indefinitely.

Reset() clears notifiedAt together with the waiter maps to prevent stale signals
after a publisher leaves and rejoins.

2. sfu/janus/subscriber.go — waiter created inside retry loop

Move newPublisherConnectedWaiter inside the JANUS_VIDEOROOM_ERROR_NO_SUCH_FEED
case, creating a fresh waiter on every retry iteration:

retry:
    // join attempt
    case janus.JANUS_VIDEOROOM_ERROR_NO_SUCH_FEED:
        waiter, stop := p.mcu.newPublisherConnectedWaiter(p.publisher, p.streamType)
        if err := waiter.Wait(ctx); err != nil {
            stop()
            ...
        }
        stop()
        goto retry

This eliminates the busy-loop: each iteration creates a new waiter. If the publisher
was already connected (sticky signal within TTL), the new waiter returns immediately
and the retry join succeeds. If the publisher is not yet connected, the waiter blocks
until the signal arrives.

Observed behaviour after fix

Production system (Nextcloud Talk 23.0.4, spreed-signaling 2.1.1, Janus 1.4.0):

  • Before fix: No such feed errors in clusters of 10–44 per reconnect event,
    affecting 2–5 users per incident, occurring ~10 times per day during business hours
  • After fix: No such feed = 0 across multiple reconnect events

Testing

The race can be reproduced by:

  1. Starting a multi-participant call
  2. Forcing simultaneous WebSocket reconnect of all HPB clients
    (e.g. restart the signaling server or expire proxy timeout)
  3. Observing whether subscribers receive audio after reconnect

Without fix: some subscribers get No such feed and never receive audio.
With fix: all subscribers successfully subscribe after retry.

When subscribers reconnect simultaneously with publishers (e.g. after a
proxy timeout), the publisher's webrtcup event fires before the subscriber's
waiter is created. The notification is lost, the subscriber waits until
context timeout, and the call participant cannot hear others.

Two issues fixed:

1. async/notifier.go: add sticky notification with 30-second TTL.
   Notify() now records the notification time. NewWaiter() checks if the
   key was notified recently and returns an already-closed channel so the
   caller retries immediately instead of waiting indefinitely.

2. sfu/janus/subscriber.go: move waiter creation inside the retry loop.
   The original code created one waiter before the loop and reused it,
   causing a tight busy-loop when the waiter channel was pre-closed by
   the sticky notifier. Creating a fresh waiter per retry eliminates this.

Observed in production: clusters of 10-44 "No such feed" errors per
reconnect event affecting 2-5 users, occurring ~10x/day. After fix: 0.
@vladopol

vladopol commented Jun 2, 2026

Copy link
Copy Markdown
Author

Production validation

Running this patch in production for 5 days on a deployment with ~30 concurrent users (Nextcloud Talk 23.0.5, HPB 2.1.1, Janus 1.4.0, Talk Desktop v2.1.x on Windows):

Metric Before patch After patch (5 days)
"No such feed" errors 10–44 per reconnect event 0
Reconnect events/day ~10 observed 4 — zero errors
Users hearing silence 2–5 per event 0

The reconnect storms that triggered the race were caused by simultaneous WebSocket drops (clients that connected at similar times hit a proxy timeout together). After the fix, multiple identical reconnect events produced zero "No such feed" errors.

The 30-second TTL in notifier.go comfortably covers the typical subscriber reconnect window (observed: 1–8 seconds after publisher reconnects in our environment).

A dedicated issue with full reproduction details has been opened: #1257

@fancycode

Copy link
Copy Markdown
Member

Thanks, I will be traveling for the next two weeks and will take a look when I'm back.

@fancycode

Copy link
Copy Markdown
Member

Added an alternative fix in #1263 that doesn't need to remember the signaled events using a TTL map but keeps the notification object alive while the publisher still exists. This makes it clearer to reason about how long events are kept alive, especially in case of clients stopping and starting to publish.

Included a test that makes sure the publisher receives and signals the connected event before the subscriber waits for it (which would trigger the race without the change).

Would be great if you could test this in your environment, too. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants